reviewstack: fix verify addons/ folder github ci #7

ahornby · 2023-09-04T08:46:44Z

The container image was missing the gh command line, resulting in CI errors. Then found it was intermittently failing due to logging after finish, added jest options.

Test plan:

Run the ci. Before, broken with:

Error: Unhandled error. (Error: GhNotInstalledError: Error: Command failed with ENOENT: gh api graphql -f searchQuery=repo:facebook/sapling is:pr author:@me -F numToFetch=100 --hostname github.com -f query=
    query YourPullRequestsQuery($searchQuery: String!, $numToFetch: Int!) {
  search(query: $searchQuery, type: ISSUE, first: $numToFetch) {
    nodes {
      ... on PullRequest {

Then found it was intermittently failing due to logging after finish with:

  ●  Cannot log after tests are done. Did you forget to wait for something async in your test?
    Attempted to log "Watchman:  Watchman was not found in PATH.  See https://facebook.github.io/watchman/docs/install.html for installation instructions".

      at console.error (../node_modules/@jest/console/build/BufferedConsole.js:163:10)
      at spawnError (node_modules/fb-watchman/index.js:173:13)
      at ChildProcess.<anonymous> (node_modules/fb-watchman/index.js:198:5)

Given watch was disabled, added the --no-watchman --detectOpenHandles options as well.

After, can see the setTimeout is causing issues in test

@me

analytics.test.ts was trying to call the real gh cli rather than a mock, which was causing errors when gh not installed, and process and handle leaks when gh was installed. Fixed bu using mock execa as we don't need real gh to test the analytics tracking Jest was also reporting leaks of timeouts after above fix, so added dispose() to debounce.ts and call dispose() from githubCodeReviewProvider Found shared/ depends on reviewstack generated textmate grammars, so can't run concurrently without flakiness (fails with missing import). Fixed test ordering in verify-addons-folder.py Tests also log about missing watchman which triggers a failure due to logs after test exit, even if watchman disabled. Install fb-watchman to silence it Tes plan: Before, ./verify-addons-folder.py tests broken in CI with: ``` Error: Unhandled error. (Error: GhNotInstalledError: Error: Command failed with ENOENT: gh api graphql -f searchQuery=repo:facebook/sapling is:pr author:@me -F numToFetch=100 --hostname github.com -f query= query YourPullRequestsQuery($searchQuery: String!, $numToFetch: Int!) { search(query: $searchQuery, type: ISSUE, first: $numToFetch) { nodes { ... on PullRequest { ``` ``` After, works

Summary: We have discovered a deadlock in EdenFS in S412223. The deadlock appears when all fsChannelThreads have a stack trace that looks like: ``` thread facebook#225, name = 'FsChannelThread', stop reason = signal SIGSTOP frame #0: 0x00007f7ca85257b9 frame #1: 0x0000000007d73aa4 edenfs`bool folly::DynamicBoundedQueue<folly::CPUThreadPoolExecutor::CPUTask, false, false, true, 8ul, 7ul, folly::DefaultWeightFn<folly::CPUThreadPoolExecutor::CPUTask>, std::atomic>::canEnqueue<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>(std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>> const&, unsigned long) + 212 frame #2: 0x0000000007d73557 edenfs`bool folly::DynamicBoundedQueue<folly::CPUThreadPoolExecutor::CPUTask, false, false, true, 8ul, 7ul, folly::DefaultWeightFn<folly::CPUThreadPoolExecutor::CPUTask>, std::atomic>::tryEnqueueUntilSlow<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>, folly::CPUThreadPoolExecutor::CPUTask>(folly::CPUThreadPoolExecutor::CPUTask&&, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>> const&) + 39 frame #3: 0x0000000007d722c0 edenfs`facebook::eden::EdenTaskQueue::add(folly::CPUThreadPoolExecutor::CPUTask) + 576 frame #4: 0x0000000005172cbd edenfs`folly::CPUThreadPoolExecutor::add(folly::Function<void ()>) + 557 frame #5: 0x000000000503c4d8 edenfs`void folly::Executor::KeepAlive<folly::Executor>::add<folly::Function<void (folly::Executor::KeepAlive<folly::Executor>&&)>>(folly::Function<void (folly::Executor::KeepAlive<folly::Executor>&&)>&&) && + 216 frame #6: 0x000000000503c257 edenfs`folly::futures::detail::DeferredExecutor::addFrom(folly::Executor::KeepAlive<folly::Executor>&&, folly::Function<void (folly::Executor::KeepAlive<folly::Executor>&&)>) + 135 frame #7: 0x000000000503baf5 edenfs`folly::futures::detail::CoreBase::doCallback(folly::Executor::KeepAlive<folly::Executor>&&, folly::futures::detail::State) + 565 frame #8: 0x00000000052e5e0d edenfs`folly::futures::detail::CoreBase::proxyCallback(folly::futures::detail::State) + 493 frame #9: 0x00000000052e5bb2 edenfs`folly::futures::detail::CoreBase::setProxy_(folly::futures::detail::CoreBase*) + 66 ... ``` This stack means that an FsChannelThread is blocked waiting to enque a task to the EdenTaskQueue. The EdenTaskQueue is the task queue that FsChannelThread's pull work from. So the deadlock is that all FsChannelThreads are blocked trying to add to a full queue which can only be emptied by FsChannelThreads. There are two contributing reasons for this happening: 1. The FsChannelThread will queue a bunch of network fetches into the Sapling queue. When those all finish the future callback chain runs on the FsChannelThreads. So the backing store accumulates a bunch of work then when the fetches complete it dumps a bunch of tasks onto the FsChannelThread's queue. So the backing store is filling up the FsChannelThread task queue. Other threads could theoretically do this to, but backingstore is the main one I have seen. 2. the FsChannelThread might enqueue to their own threads. Folly futures have some smarts to try to prevent a task running on executor a from enqueueing work onto executor a (and instead just run the callback inline), see: https://www.internalfb.com/code/fbsource/[c7c20340562d2eab5f5d2f7f45805546687942d9]/fbcode/folly/futures/detail/Core.cpp?lines=147-148. However, that does not prevent a future that is unknowningly running on an executor's thread from enqueueing to that thread's executor queue. I belive that kind of thing happens when we do stuff like this: https://www.internalfb.com/code/fbsource/[824f6dc95f161e141bf9b821a7826c40b570ddc3]/fbcode/eden/fs/inodes/TreeInode.cpp?lines=375-376 The outerlamba is aware which exector it's running on, but the future we start inside the lambda is not, so when we add that thenValue, it doesn't realize it's enqueuing to it's own executor. I wrote up this toy program to show when folly will enqueue vs run inline: https://www.internalfb.com/intern/commit/cloud/FBS/bce3a906f53913ab8dc74944b8b50d09d78baf9a. script: P1222930602, output: P1222931093. This shows if you return a future from a future callback, the next callback is enqueued. and if you have a callback on a future returned by another future's callback, the callback on the returned future is enqueued. So in summary, backingstore fills up the FsChannelThread work queue then FsChannelThread trys to push work onto it's own queue and deadlocks. Potential solutions: **1- Queued Immediate Executor.** This behavior was likely introduced here: D51302394. We moved from queued immediate executor to the fschannelthreads. queued immediate executor uses an unbounded queue looks like: https://www.internalfb.com/code/fbsource/[c7c20340562d2eab5f5d2f7f45805546687942d9]/fbcode/folly/executors/QueuedImmediateExecutor.h?lines=39 so that is why we didn't have the problem before. I don't love going back to this solution because we are moving away from queued immediate exector because it makes it easy to cause deadlocks. For example the one introduced in D50199539. **2- Make the queue error instead of block when full.** We use to do that for the Eden CPU thread pool, and it resulted in a lot of errors from eden that caused issues for clients. See: D6490979. I think we are kicking the can down the road if we error rather than block. **3- Don't bother switching to the fschannelthreads to complete the fuse request.** This is likely going to be the same as 1. Unless we are going to undo the semifuture-ization we have been doing. or perhaps we could start the fuse request on the fschannel threads, then finish it on the eden cpu threads. Which is pretty much the same thing as this diff except more sets of threads involved. So I prefer this change. **4- add backpressure somewhere else.** If we prevent the backingstore/other threads from being able to fill up the fschannelthread queue then it should be impossible for the queue to fill up. Because there would be no fan out (one task out then one task in). However, this is fragile, we could easily introduce fan out again and then end up back here. Also this would mean we basically block all requests in the fuse layer instead of the lower layers of eden. We would need to redo the queueing in the backing store layer. The fragile-ness and complexity makes me not like this solution. **5 - linearize all future chains.** The reason that the fschannelthreads are enqueing to their own queue is that we have nested futures. If we didn't nest then folly future would run callbacks inline. So if we de-nest all our futures this issue should theoritically go away, because the fschannelthreads will not enqueue to their own queue. So if we de-nest all our futures this issue should theoritically go away. However, I don't know if we can completely get rid of returning a future in a future callback. I don't love this solution as I know there are some explicit places where we choose to nest (I'm looking at you PrjFSDispatcher). so it would be VERY confusing where am I supose to nest and where am I not. it would be easy to do the wrong thing and re-intoduce this bug. Also this is a ton of places we need to change and they are not easy to find. So don't like this option because it's not very "pit of success" - too fragile and too easy to get wrong the first time. **6 - unbound the fschannelthread queue.** It's not great. But there is precident for this. We unbounded the eden cpu thread pool for basically the same reason. See D6513572. The risk here is that we are opening out selves up to OOM. The queue might grow super duper large and then get eden oom killed. We probably should add a config to this change so we can roll this out carefully and watch for ooms as we rollout. Additionally, long term we likely want to rethink how we do threading to archetect eden away from this. I prefer this solution the most. That's what I have implemented here. --------------------------- note: I am removing the limit on the number of outstanding fs request we process at once in this diff. That config was not exactly working how we wanted any ways. Queueing in the backing store let us handle essentially infinite requests at once as the Sapling request queue does not have a max size. I can follow up with a semaphore in the fuse/nfs layers to rate limit the number of active requests. Though fwiw I will likely bump the limit at least initially by a lot when I do that since we realisiticly were allowing clients to do infinite requests previously. Reviewed By: jdelliot, genevievehelsel Differential Revision: D56553375 fbshipit-source-id: 9c6c8a76bd7c93b00d48654cd5fc31d1a68dc0b2

ahornby force-pushed the fix_verify_addons branch 4 times, most recently from c3206f7 to f0f0800 Compare September 4, 2023 17:43

ahornby force-pushed the fix_verify_addons branch 8 times, most recently from 5b09395 to 3426897 Compare September 18, 2023 17:50

ahornby force-pushed the fix_verify_addons branch from 3426897 to 705bdf3 Compare September 18, 2023 19:41

ahornby force-pushed the fix_verify_addons branch from 705bdf3 to 1a810fc Compare September 18, 2023 19:55

ahornby closed this Oct 6, 2023

ahornby deleted the fix_verify_addons branch September 22, 2024 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reviewstack: fix verify addons/ folder github ci #7

reviewstack: fix verify addons/ folder github ci #7

ahornby commented Sep 4, 2023

reviewstack: fix verify addons/ folder github ci #7

reviewstack: fix verify addons/ folder github ci #7

Conversation

ahornby commented Sep 4, 2023